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Abstract The output of an association rule miner is often huge in practice. This is 
why several concise lossless representations have been proposed, such as the "es- 
sential" or "representative" rules. We revisit the algorithm given by Kryszkiewicz 
(Int. Symp. IntelHgent Data Analysis 2001, Springer- Verlag LNCS 2189, 350-359) 
for mining representative rules. We show that its output is sometimes incomplete, 
due to an oversight in its mathematical validation. We propose alternative complete 
generators and we extend the approach to an existing closure-aware basis similar to, 
and often smaller than, the representative rules, namely the basis ^* y. 



1 Introduction 

Association rule mining is among the most popular conceptual tools in the field 
of Data Mining. We are interested in the process of discovering and representing 
regularities between sets of items in large scale transactional data. Syntactically, the 
association rule representation has the form of an implication, X ^Y; however. 
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whereas in Logic such an expression is true if and only if Y holds whenever X does, 
an association rule is a partial implication, in the sense that it is enough if Y holds 
most of the times X does. 

To endow association rules with a definite semantics, we need to make precise 
how this intuition of "most of the times" is formalized. There are many proposals 
for this formalization. One of the frequently used measures of intensity of this kind 
of partial implication is its confidence: the ratio between the number of transactions 
in which X and Y are seen together and the number of transactions that contain X. 
In most application cases, the search space is additionally restricted to association 
rules that meet a minimal support criterion, thus avoiding the generation of rules 
from items that appear very seldom together in the dataset (formal definitions of 
support and confidence are given in Section 2.1). 

Many association rule miners exists, Apriori (see [Agrawal et al., 1996]) being 
one of the most widely discussed and used. The major problem shared by all mining 
algorithms is that, in practice, even for reasonable support and confidence thresh- 
olds, the output is often huge. Therefore, several concise lossless representations 
of the whole set of association rules have been proposed. These representations are 
based on different notions of "redundancy". In one of these, a rule is redundant if it is 
possible to compute exactly its confidence and support from other information such 
as the confidences and supports of other informative rules (see [Kryszkiewicz, 2002, 
Luxenburger, 1991, Hamrouni et al., 2008, Pasquier et al., 2005]); this is a quite de- 
manding property. We settle for a weaker version proposed in several works; infor- 
mally, in that version, a rule is redundant with respect to another one if its confidence 
and support are always greater, in any dataset. To avoid this redundancy, exactly one 
notion has been identified in several sources, namely the representative rules; and 
a closure-aware variant both of the redundancy notion and of the redundancy-free 
basis is given in [Balcazar, 2010a] (precise definitions and references are given be- 
low). 

We focus in this paper on the main results of [Kryszkiewicz, 2001], where a pur- 
portedly faster algorithm to construct representative rules is given, and show by an 
example that that algorithm is not guaranteed to always output all representative 
rules, because it is based on a property that does not hold in general; namely, the 
characterization of the frequent closed sets that admit a decomposition into repre- 
sentative rules misses some such sets. We propose an alternative, complete char- 
acterization, leading us to the proposal of a first alternative algorithm that is guar- 
anteed to output all the representative rules: we pre-compute, for each closed set, 
some parameters that depend on the confidence and support thresholds, and then 
use the above mentioned new characterization to generate all representative rules. 
Compared to the potentially incomplete algorithm in [Kryszkiewicz, 2001], this al- 
gorithm, guaranteed to be complete, has a main drawback: in [Kryszkiewicz, 2001], 
the internal local parameters only depend on the support threshold, but in our al- 
gorithm these parameters depend also on confidence. Therefore, each time a new 
confidence threshold is introduced by the user, the algorithm has to redo all com- 
putations. Thus, we provide a second algorithm, composed of two parts: the first 
one is a pre-processing phase, dependent only on support, in which a subdivision 
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of the interval (0, 1] is associated to each closed itemset, and the second part uses 
this partition to determine, for a given value of the confidence threshold, which are 
those sets that can generate representative rules. 

Then, we extend the process to a similar basis which profits from the more pow- 
erful redundancy notions available for full-confidence implications to often obtain 
smaller bases in many applications. 

There are a couple of subtle differences between one of the usual definitions of 
association rule (the one we employ) and the one in [Kryszkiewicz, 2001]. First, 
we do allow having rules with empty antecedent (clearly, all of them have confi- 
dence equal to the normalized support of the consequent). Moreover, we do not 
require the inequalities to be strict when imposing a given support and confidence 
threshold. This is just a small detail that comes handy when the user is interested 
in obtaining the set of all representative rules of confidence 1 . However, we have 
carefully tuned all our argumentations in such a way that these differences are not 
relevant; for instance, we have chosen a counterexample that invalidates Property 9 
of [Kryszkiewicz, 2001] independently of which of the two definitions is used. 

The article is structured as follows. In Section 2 we introduce the basic no- 
tions and notations that will be used throughout the paper and part of the con- 
tents of [Kryszkiewicz, 2001]; and we show that the algorithm provided there is 
not guaranteed to always provide the whole set of representative rules. In Section 3 
we define new parameters and discuss their usefulness in generating the set of all 
representative rules, providing also efficient algorithms for this task. We describe 
in Section 4 a parallel development for an alternative basis, often smaller than the 
representative rules. Section 5 contains a comparison of our approach with the one 
in [Kryszkiewicz, 2001] on some datasets. Concluding remarks and further research 
topics are presented in Section 6. 



2 Preliminaries 

A given set of available items is assumed; subsets of it are called itemsets. We 
will denote itemsets by capital letters from the end of the alphabet, and use juxta- 
position to denote union, as in XY. The inclusion sign as in X C F denotes proper 
subset, whereas improper inclusion is denoted X C Y. For a given dataset consist- 
ing of II transactions, each of which is an itemset labeled with a unique transaction 
identifier, we define the support sup{X) of an itemset X as the ratio between the 
cardinality of the set of transactions that contain X and the total number of transac- 
tions n. An itemset X is caWtd frequent if its support is greater than or equal to some 
user-defined threshold T G (0, 1]. We denote by = C <^ | sup{X) > t} the set 
of all frequent itemsets. 

Given asetX C^, the closure X ofX is the maximal set (with respect to the set 
inclusion) Y C'^ such that X CY and sup{X) = sup{Y). It is easy to see that X is 
uniquely defined. We say that a set X C is closed if X =X. 
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Closure operators are characterized by the three properties of extensivity: X CX; 
idempotency X =X; and monotonicity: XCYifXCY. Moreover, intersections of 
closed sets are closed. The empty set is closed if and only if no item appears in each 
and every transaction. 

A minimal generator is a set X for which all proper subsets have closures dif- 
ferent from the closure of X (equivalently, X is a minimal generator if and only if 
sup{Y) > sup{X) for all Y C X). 

Also, FCx = {X ^ Fx \ X = X} represents the set of all frequent closed sets, 
and FGx = {X G Ft | C X,sup{Y) > sup{X)} is the set of all frequent minimal 
generators. Note that FC^ constitutes a concise lossless representation of frequent 
itemsets, since knowing the support of all sets in FC-c is enough to retrieve the 
support of all sets in F^- 

Example 1. Let ^ be the dataset represented in Table 1 where the universe ^ of 
attributes is {a,b,c,d,e,f}, and consider the threshold T = 0.15. Clearly, all sub- 
sets of ^ are frequent, FCx = {V),a,b,c,ab,ac,ad,bc,abcde,abcdef} and FG^ = 
{9,a,b,c,d,e,f,ab,ac,bc,bd,cd,abc} (we abuse the notation and denote sets by 
the juxtaposition of their constituent elements). 

Table 1 Dataset ^ 
a b c d e f 
111111 
111110 
110 
10 10 
110 
10 10 



2.1 Association Rules and Representative Rules 

Given X in Ft:, the following two notions were introduced in [Kryszkiewicz, 2001] 
(with longer names): 



That is, mxsT:{X) represents the maximum support of all proper frequent closed 
supersets of X, and mnsx{X) is the minimum support of minimal generators that are 
proper subsets of X. The extra and °° are added in order to make sure that mxsT:{X) 
and mns-iiX) are defined even for the cases in which X has no proper supersets that 
are frequent and closed, or when it does not have proper subsets that are minimal 



mxsT:{X) 
mnSilX) 



ma\{{sup{Z) I Z G FCr,Z D X} U {0}), 
mm{{sup{Y) \ Y G FG^J C X} U {oc}). 
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generators. It is easy to check that mxsT:{X) < sup{X) < mns-c{X). Moreover, in 
[Kryszkiewicz, 2001] it is shown that: 

Proposition 1. Given T G (0, 1] and an itemset X G F^, X is closed if and only if 
supiX) > mxsT:{X) andX is a minimal generator if and only if sup {X) < mns-c{X). 

The association rules considered in this work are implications of the form X 
Y, where XJ C'^, 7 and X n F = 0. In [Kryszkiewicz, 2001], rules with 
X = are disallowed, but we do permit them as in practice such rules often play 
a useful role related to coverings, described below. The confidence of X — > F is 
conf{X — !> F) = sup{XY) / sup{X), and its support is sup{X -^Y) = sup{XY). The 
problem of mining association rules consists in generating all rules that meet the 
minimum support and confidence threshold criteria, i. e. enumerate the following 
set; AR-,_y = {X^Y\ sup{X F) > T,conf{X -^Y)> 7}. 

Since the whole set of association rules is quite big in real-world applica- 
tions, a number of formalizations of the notion of redundancy among associ- 
ation rules have been introduced (see [Aggarwal and Yu, 2001, Balcazar, 2010a, 
Kryszkiewicz, 1998b, Pasquieret al., 2005, Phan-Luong, 2001, Luxenburger, 1991, 
Zaki, 2004, Cristofor and Simovici, 2002], the survey [Kryszkiewicz, 2002], and 
Section 6 of [Ceglar and Roddick, 2006]). In one common approach, the cover 
set C{X ^ y) of a rule X ^ Y is defined by C{X Y) = {X' Y' \ X C 
X' andX'y' C XF}. Such rules X' Y' are redundant with respect to X F in 
the following sense (see [Aggarwal and Yu, 2001, Kryszkiewicz, 1998b] and also 
[Kryszkiewicz, 1998a, Balcazar, 2010a, Phan-Luong, 2001]): 

Proposition 2. Let r, r' be association rules. Then / G C(r) implies sup{r') > sup{r) 
and conf[r') > conf(r). 

In fact, this implication is a full characterization, that is, if r' has always at least 
the same confidence and at least the same support as r then it must belong to the 
cover set. Avoiding such redundancies leads to the set RRr y of representative asso- 
ciation rules. A rule r in ATJ^ y is said to be representative, or essential, if it is not 
contained in the cover set of any other rule in AR-^ y, i. e. 

RRt.y ={re AR-r.y I Vr' G A/J^.y ('" e C(r') ^r = r')}. 
Proposition 3. The following properties hold: 

• RRr,y = {X ^Y (E ARr.y I -^3X' ^ F' G AR^.p {X = X' ,XY C X'Y') or {X' C 
X,XY ^X'Y')} 

• ifX Z\X with X CZis in RRr,y then Z G FCr and X G FGr. 

Therefore, any algorithm that aims at the discovery of all representative rules 
should consider only rules of the form X — > Z\X with X C Z, Z G FCr and X G FG^ . 
Clearly, not all sets in FCx can be decomposed in such a way, and one should look 
only into those that do. 
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Example 2. Consider the dataset in Example 1 . The set ad is both frequent and 
closed, but none of the rules a d, d ~^ a or (/> ad sae representative given the 
thresholds t = 0.15 and y = 0.33: a ^ (i is in the cover set of a — > bd, t/ -H- a is in 
the cover set of d ^ ab and ^ at/ is in the cover set of — > abd. Also, it is easy to 
check that, at T = 0.15 and y— 0.4, one can obtain representative rules exactly out 
of the following closed sets: ab, ac, ad, be, abcde, and abcdef. 

So, if we denote by RIx y the set of all frequent closed itemsets from which at 
least one representative rule can be generated, one possible approach to represen- 
tative rule mining is to synthesize first the set RIx,y, and then, for each element Z 
in RIx,y, to find non-empty subsets X such that X Z\X is representative. This is 
precisely the idea behind Algorithm GenRR in [Kryszkiewicz, 2001]. The problem 
there is that the characterization of the set Rl^ y given by Property 9 of the same 
paper (on page 355) is incorrect, possibly leaving out some of the sets that can 
lead to representative rules. Namely, it is stated that RIx,y = {X e FCx \ sup{X) > 
y*mnsT:{X) > mxsT:{X)}; right-to-left inclusion indeed holds, but equality does not 
hold in general, as one can see from the following counterexample. 

Example 3. Consider the itemset X = abcde in Example 1, and assume T = 0.15 
and 7 = 0.4. Let us verify that abcde e £ FCr | sup{X) > y*mnsT:{X) > 

mxsT:{X)}. Clearly, the rule b acde is in ATJ^.y, having support 2/6 and confidence 
0.5. Moreover, by extending the right-hand side or moving the item b to the right- 
hand side we get only the rules b — > acdef, abcde and abcdef of confi- 
dence 1/4, 2/6 and 1/6, respectively. Hence, we can conclude that b acde G RRz.y- 
On the other hand, mxsT:{X) = 1/6 and mnsilX) = 2/6, so y*mnsT:{X) = 0.8/6 is 
strictly smaller than mxsilX). In this case. Algorithm GenRR does not work cor- 
rectly since it does not list the rule b acde as being representative. 

An alternative counterexample is given in the proof of Lemma 1 below. 



3 Characterizing Representative Rules 

The goal of pruning off sets that do not give representative rules, by keeping only 
RIt:,y, cannot be reached using the bounds given, as we have seen that this set com- 
prises all X in FCt: with sup{X) > y*mnsT:(X) > mxsriX) but may also include other 
frequent closed sets X that do not satisfy the condition y*mnsx{X) > mxsT:{X). We 
consider two alternatives. 



3.1 Closed Sets Instead of Minimal Generators 

For closed X, mnsx{X) is almost the same thing as the minimal support among all 
proper subsets of X, or again among all proper closed subsets of X; all these notions 
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coincide when X is its own minimal generator, otherwise they only differ due to the 
minimal generators of X. Therefore it makes sense to try and exclude the minimal 
generators of X from consideration. This way, we get another parameter, 
bmns^{X) = imn{{sup{Y) \ Y e FC^J C X}U{°°}). 
The value of bmnsT: is never smaller than mnsT: as we shall shortly see. Thus, 
there will be more sets that meet the condition y*bmnsT:{X) > mxsT:{X). 

Proposition 4. The following properties hold. 

• bmns^{X) = mm{{siip{Y) \ Y G FCJ C X} U {°c}), 

• mnsT:{X) <.bmns^{X), 

• ifX G FCx nFGz then mnSx{X) — bmnSx{X), 

Proof. We omit the proof of the first two claims because they are straightforward. 
So, let X be a frequent closed set that is also a minimal generator If X = 0, 
then mnsT:{X) ~ bmnsT:{X) = oo. Otherwise, let Y e FG^ be such that F C X 
and mns^iX) = sup{Y). Clearly, F G FCr and F C X = X. Since X e FG^ and 
Y C X, sup{Y) > sup{X) and hence sup{Y) > sup{X), and therefore Y dX. We 
get sup{Y) > bmnsT:{X) and mnsT:{X) > bmnsT:{X). Combining it with the fact that 
mnsT:{X) < bmnsx{X) always holds, we conclude that mnsx{X) = bmnsx{X). □ 
Unfortunately, the new parameter can still leave out some sets in Rlx y. 

Lemma 1. TJ/^.y %{X e FCr \ sup{X) > y^bmnsxiX) > rnxs^iX)}. 

Proof. Let f/ = {a,b,c} and & be the dataset containing the following 13 trans- 
actions: ti = ■■■ = fg = abc,tg = ab,tio = tn ^t\2= a,fi3 = b; assume T = 0.07 
and 7 = 0.7. One can check that, although ab G RIt,-/ (since a —?' b G RRx,y), both 
bmnsxiflb) = 10/13 and mnsx{ab) = 10/13; but y *mnsx{ab) = y*bmnsx{ab) = 
7/13<S/13=mxsx{ab). □ 

The next construction shows that by using bmns^ instead of mnsx we can even 
leave out some sets in Rlx.y that would not have been left out otherwise. 

Lemma 2. i?/T,y n {X G FCx \ sup{X) > y*mnsx{X) > mxsriX)} ^ {X G FCr \ 
supiX) > y*bmnsr{X) > mxsr{X)}. 

Proof. Let = {a,b,c,d,e} and & he a dataset containing 35 transactions: t\ = 
t2 = abode, tj = t4 = ts = abed, fe • ■ ■ = ^20 = fl and t2i — ■ ■ - t^s — b. Pick T = 0.05 
and 7 = 0.75. Note that ab ^ cd e RRr.y, and therefore abed G Rlr.y- Now, 
mnsr{abcd) = 5/35, bmnsriabcd) = 20/35, sup{abcd) = 5/35 and mxsr{abcd) = 
2/35. Although y*mnsr{abcd) = 3.5/35 = 0.1 belongs to the interval [2/35,5/35), 
y* bmnsr{abcd) = 15/35 does not. □ 



3.2 Minimal Generators of Bounded Support 

In order to give a complete characterization for the set Rlr,^, let us first introduce the 
following notation: for a set X in FCr, mxgSr y{X) is the maximal support of those 
minimal generators that are included in X and are not more frequent than sup{X) / y: 
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mxgs^^{X)^max{{sup{Y) \ Y E FGrJ C X ,Y* sup{Y) < sup{X)}Li{0}). 

Note that mxgs^ y{X) is either 0, or it is greater than or equal to sup{X). We prove 
two propositions that explain how we can use this value in order to compute the set 
RIx,y and how to find, given X £ Rlz.y, a subset cX such thatXo -^X\X{y G RRx.y- 

Proposition 5. The following equality holds. 

Rlr.y ={Xe FCt: \ r*mxgs^ y{X) > mxs^{X)}. 
Proof. Let X be an arbitrary set in Rl^ y, and take Xq in FGr such that Xq C X and 

Xo X\Xo e RRr.y 

We have, on one hand, conf{X() — > ^\^o) ^ 7' ^nd on the other hand, the rule 
should not be in the cover set of any other rule with confidence greater than 7, i. e. 
confiXo Z\Xo) < 7 for all Z G FC^ with ZdX. 

That is, sup{X) > 7* sup{Xo) > sup{Z) for all Z G FC^ with Z D X. From the 
first inequality, we deduce that Xq meets all the conditions in order to be considered 
for the computation of nixgs^ y(^)' ^nd therefore, mxgST- y{X) > sup{X()). From the 
second, we get Y*sup{Xo) > ot.Mt(X). We conclude that y*mxgs^ y{X) > mxs-clX). 

Conversely, let X G FCr be such that 7* ;7zxg5^ y{X) > mxsT:{X). It is clear that 
mxgs^ y{X) cannot be (since mxsT:{X) > 0), so 

{Y eFGt: I Y cX,Y*sup{Y)<sup{X)}^(l>. 

Take Xq G FG-c to be a set of maximal support that belongs to that set. There- 
fore, we have mxgs^y{X) = sup{Xo). Since sup{Xo — s> X\Xq) = sup{X) > T and 

co«/(Xo X\Xq) = > 7 we deduce that Xq X\Xo e AR^^y. Note that for 

any ZdX, conf{Xo ^ Z\Xo) = < '^^^ - < 7 Moreover, for 

any Xq C Xq, sup{Xq) > sup{Xo) (since Xq G /^Gt) and 7* sup{Xq) > sup{X) (due to 

the choice we have made for Xq). This is why conf{X^ X\X^) = ^^^^ < 7. We 

conclude that Xq -^X\X() G RRr^y and X G /?/T,y. □ 

The previous proposition characterizes unequivocally Rl^^y. Simple arithmetic 
suffices to check that Proposition 5 identifies exactly the closed sets from which 
representative rules follow as per Example 2. However, we also need a practical 
method for identifying the set of representative rules. To this end, we give necessary 
and sufficient conditions for a subset of an itemset in Rl^ y to be the left-hand side 
of a representative rule (see Proposition 6). 

Proposition 6. Let X G Rlj^y, ci ~ mxsT:{X)/Y, C2 ~ sup{X)/Y and Xq C X. Then 
Xq -^X\Xo G RRt,y if and only if ci < sup(Xo) <C2< tnnstlXo). 

Proof. Consider X G Rlr.y and Xq C X. Clearly, Xq X\Xo G RRr.y if and only if 
the rule Xq -^X\Xo is in ATJ^.y and does not belong to the cover set of any other rule 

in AR^ y. That is equivalent to: sup{X) > T, > 7, ^^^^ < 7 for all X^^ C X 

and ^^^^ < 7 for all Z D X that satisfy sup{Z) > T. 
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Now, it is easy to see that: 
• sup{X) > T always holds because X £ FC^, 




< 7<^ C2 < mns t:{Xo), 



which concludes the proof. □ 

The correctness of Algorithm 1 trivially follows from Propositions 5 and 6. 



Algorithm 1 RR Generator 
1 : Input: support threshold T, confidence threshold y 
2: = {X C 'gr I sup{X) > t} 
3: FCr = {X eFr\X=X} 
4: FGr = {X e Ft I VF C X,sup{Y) > sup{X)} 
5: for all X e FGt do 

6: mns^{X) =mm{{sup{Y) \ 7 € FGx,7 C X} U {oo}) 
7: end for 
8: RIr.y = </l 

9: for all XeFCA{0} do 

10: mxsT:{X) = max({™p{Z) | Z g FC^.Z D X} U {0}) 

11: mxgs^^(X) =max({™p(F) | 7 e FGzJ GX,y*sup{Y) < sup{X)}U {0}) 
12: if 7*mxg.sT^(X) >;?ixiv(X') then 
13: add X to Rl-cy 

14: end if 
15: end for 

16: forallX gR/^^do 

17: Ci=mxs^(X)/Y 
18: C2=sup(X)/y 

19: Ant = {Xq g FG^ | Xq C X, ci < 4i<p(Xo) < q < m«iT(Xo)} 
20: for all Xq g Ant do 
21: output Xo ^- X\Xo 

22: end for 
23: end for 



3.3 An Algorithm for Different Confidence Thresholds 

The disadvantage of Algorithm 1, compared to the one in [Kryszkiewicz, 2001], is 
that, for a given X in FC^, mxgs^ y{X) depends on the confidence threshold, and 
hence it cannot be reused once 7 has changed, whereas both mxSf (^) and mnsT:{X) 
can be computed only once for a given value of T and then used for different confi- 
dence values. On the other hand. Algorithm 1 is guaranteed not to lose representative 
rules, whereas the one in [Kryszkiewicz, 2001] risks giving incomplete output, as in 
our counterexamples above. 
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Instead of computing mxgs^ y{X) for each and every 7, one can find the individual 
points of the interval (0, 1] where mxgST. y{X) changes its value. Indeed, given X in 
FCr\{(d}, let {Fi , . . . , Y„[x] } be the set {Y e FGr \ Y cXjin descending order of 
support. It is easy to see that 



mxgs^JX) = 



sup{Yi), if7< 
sup{Yi+i), if 7 6 
.0, 



.mp{X) 

™p(y,)' 

.mp(X) 
.s,ip(Yi) ■ 

Otherwise. 



supjX) 



sup{Yi^ 



/e{i,...,«[x]-i}. 



Let us introduce the following notation: for / G =sup{Yj) and 

= sup{X) / sup{Yi). Moreover, po\X] = 0. Now, each time a new value of the 
confidence threshold 7 is given, one can decide whether a frequent closed set X is 
in RIx y by simply retrieving the interval {pi[X],pi+\ [X\\ with ; G {0, . . . ,n[^] — 1} 
to which 7belongs (recall that in this case mxgs^ y{X) = yi+\ [X]) and then checking 
whether the inequality 7*3',+ i[^] > mxsT:{X) holds. Note that if no such ; exists 
(that is, whenever 7has a value strictly greater than p„\^x] [^])^ nixgs^ y{X) takes the 
value 0, which makes 7* mxgs^ y{X) smaller than or equal to mxsi^X). 

These ideas are implemented in Algorithms 2 and 3. 



Algorithm 2 RR Generator - preprocessing phase 
1 : Input: support threshold T 
2: Ft = {X C f/ I sup(X) > t} 
3: FCr = {XeF^ 1^ = ^} 
4: FGt = {X e Fr I VF C X,sup{Y) > siip{X)} 
5: for all X e FGt do 

6: mns^(X) = mm({sup{Y) \ Y e FG^,Y CX}U {<=o}) 
7: end for 

8: forallX gFCT\{0} do 

9: mxs^{X) = max{{sup{Z) \ Z e FCt,Z D X} U {0}) 
10: n[X] = \{Y eFGr\Y CX}\ 

11: let {7] , ... , Y^^x] } ^ set {F € FG^ | F C X} in descending order of support 

12: forallie {l,...,n[X]}do 

13: yi[X] = sup{Yi) 

14: pi[X]=!,-up{X)/yi[X] 

15: end for 

16: po[X]=0 

17: end for 



4 Characterizing the Basis for Closure-Based Redundancy 

The results of the previous sections can be extended to find a list of rules such that 
any other rule in ATJ^.y is redundant with respect to one rule in our fist and the set 
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Algorithm 3 RR Generator - second phase 
1 : Input: support threshold T, confidence threshold y 
2: Rh,y = % 

3: for all X g FCA{0} do 

4: if 3i e (0, . . . , n[X] - 1 } such that y e (/:.,■ [X\ , pi+ 1 [X]] then 

5: i{Y*yi+i[X]> mxsr{X) then 

6: add X to R/^ 

7: end if 

8: end if 

9: end for 

10: for allX eR/x ydo 
11: ci = mxsT:{X)/y 
12: C2 = sup{X)/y 

13: Ant = {Xq 6 FC^ \ Xq C X, ci < 4-«p(Xo) < C2 < mnsT:{Xo)} 
14: for all Xq e Ant do 
15: output Xo ^- X\Xo 
16: end for 
17: end for 



of full-confidence implications. This is exactly the idea behind a basis for closure- 
based redundancy [Balcazar, 2010a]. 

Let be a set of implications, i. e. rules that hold with confidence 1. Partial 
rule X' — !■ Y' is closure-based redundant relative to with respect to X — 7 if any 
dataset & in which all the rules in ^ hold with confidence 1 gives conf{X' — > Y') > 
conf{X Y). 

Closure-based redundancy and standard redundancy coincide when the set of 
implications ^ is empty. Knowing the set S§ is equivalent to knowing how the 
closure operator works on each set. If the set of implications is empty, then any 
subset is closed and all the closure-related argumentations trivialize; in particular, 
in this case the set of representative rules forms a minimum-size basis. 

In any case, we have the following characterization for closure-based redun- 
dancy: 

Theorem 1 ([Balcazar, 2010a]). Let SS be a set of exact rules, with associated clo- 
sure operator mapping each itemset Z to its closure Z. Let X' -^Y' be a rule not 
implied by S§, that is, Y' <f_ X', then the following are equivalent: 

L X andX'Y' CXY, 

2. The rule X' Y' is closure-based redundant relative to SS with respect toX -^Y. 

Note that Y' (f_ X' is equivalent to saying that X' — > Y' is not a full implication. 
One can then analogously define the closure-based cover set of a rule X — > F by 
C{X ^Y) = {X' ^Y' \ X <ZX' wAX'Y' C XY}. Accordingly, we must refine the 
notion of "different" rule since only the closures are relevant: A rule X' Y' is 
closure-equivalent (again relative to 3§) toX ^Y when X' ^X and X'Y' = XY. 

The minimum-size basis 3§* y for closure-based redundancy contains all rules 
in AR-c.y of confidence strictly smaller than 1 that are not closure-based redun- 
dant with respect to any rule in AR^^y, unless they are closure-equivalent (see 
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[Balcazar, 2010a] for details). Again the main property of this basis is that every 
rule in AR-c y is closure-based redundant with a rule in the basis. 

Proposition 1. If a rule is not in the basis, then it is closure-based redundant with 
respect to a rule in the basis that is not closure-equivalent to it. 

Proof. Indeed, \fX^Y\Xm not in the basis, some rule X' — > Y'\X' exists above the 
confidence and support thresholds for which X' (ZX and 7 C y', and either X' ^X or 
Y' ^ Y\ in turn, this rule is closure-based redundant with a rule in the basis, possibly 
itself,_say X" Y"\X\ so that X'cf = rand7Crcy" = r'; further, 
then, X" =X implies F = J, and = F implies T ^7. Therefore, if X ^ Y\X 
is not in the basis, then it is closure-based redundant with X" — )■ Y"\X" , which is in 
the basis and is not closure-equivalent to it. □ 

It is easy to check that, in all rules in this basis, the left-hand sides are also closed 
sets. We are interested in computing this basis fast. To do that, let RIx y be the set 
of all frequent closed itemsets from which at least one rule for this basis can be 
obtained. 

Proposition 8. The following equality holds. 

RIt.y ^ {X E FCz I 7*OTXgi^ y{X) > mxSi yiX) andmxgs^ y{^) > sup{X)}. 

Proof. Let X be an arbitrary set in Rlt.f- there is a basis rule Xq — ^ X\Xo for these 
confidence and support thresholds, where Xq is a proper closed subset C X. Pick 
a minimal generator Xi of Xq; as Xq is closed, sup{Xi) = sup{Xq) > sup{X)\ as 
conf{Xo —¥X\Xq) > 7, Y*sup{Xi) — y*sup{Xq) < sup{X), hence Xi participates in 
the computation of mxgs^^(X), so that mxgs^ y{X) > sup{X\) > sup{X). 

Besides, if there was a proper closed superset Z of X such that sup{Z) > x 
and c{Xq Z\Xq) > 7, then the rule Xq X\Xq would not be in the basis due 
to redundancy with Xq — s> Z\Xq. Therefore, the support of any frequent itemset 
Z with X C Z is less than y * sup{Xq). That is, mxsT:,y{X) < y* sup{Xo). Hence, 
Y^mxgST- y{X) > y*sup{Xi) = y*sup{Xo) > mxsr^yiX). 

Conversely, assume that 

7* mxgs^ yQ^) > mxsT: y{X) and mxgs^ y{X) > sup{X) 

holds for X e FCr. Indeed, sup{X) < mxgs^ y{X) implies that this last value is 
not zero, and that there is at least one itemset X\ G FG^ such that X\ <ZX and 
j*sup{Xi) < sup{X). Among these X\, we pick one with maximum support: 
mxgs^ y{X) = sup{Xi). Let Xq = Xi, so sup{Xq) = sup{Xi) > sup{X) and Xq C X. 
Then confiX^ -^X\Xq) = sup{X) / sup{Xq) > y * sup{X\) / sup{Xq) = 7, which im- 
pHes Xq -^X\Xq G A7?T,y. 

Suppose, for a contradiction, thatXo -^X\Xq is not in the basis. By Proposition 7, 
it must be closure-based redundant with respect to a rule Y Z\Y that is in the 
basis and is not closure-equivalent to it. Being in the basis implies that Y, Z £ FCr 
(and keep in mind that both Xq and X are closed as well). By Theorem 1, we have 
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that Y CXq and X C Z, where one of the two inclusions must be proper to ensure 
closure-inequivalence. IfXcZ, we have that 

conf{Y Z\Y) = — - < -— < — - < 7, 

sup{Y) sup(X()) mxgs^y(X) 

which is a contradiction with conf{Y Z\Y) > 7 as F ^ Z\Y £ SS% y C ATJt,^. The 
other possibility is that Z = X and Y C Xq, but sup{Y) > sup{Xq), because Y G FC-c, 
contradicting the maximality of sup{Xq). This finishes the proof. □ 



Proposition 9. Lef X G Rlx.y, ci = mxsx{X)/'Y, and C2 = sup{X)/Y- Consider a 
proper closed subset Xq C X. Then Xq — > X\Xq G S§y if and only if ci < sup(Xo) < 
C2 < mns-ciXo). 



Proof. Consider X G RIt.^ and a proper closed subset CX. The rule '^X\Xo is 
in 3§y if and only if it meets the support and confidence threshold requirements with 
respect to t and 7, it is not a full implication, and is not closure-based redundant 
with respect to another rule Y Z\Y. 

First of all sup{X) > T, because X G RIt.-/ so it remains to see that: 

1. conf{Xo^X\Xo)>y, 

2. conf (Y -> Z\Y) < 7 for any y,Z G FC^ such that F C andX C Z, with at least 
one of the two inclusions proper. 

The first item is equivalent to sup{Xq) < 02', for the second item we will divide the 
proof in two different steps: first, we are going to consider the case where Y C Xq 
and X CZ. 

VY C Xo, confiY ^ Z\Y) < 7 ^ ^^^^S < 7 ^ Q < mnsriXo). 

sup[Y) 

In a similar way, we obtain that for all Z such that X C Z and Y = Xq, conf{Y — ;> 
Z\Y) < 7is equivalent to ci < sup{Xo). This finishes the proof. □ 

All the three algorithms defined so far can be modified to output the set y, 
of closure-based irredundant partial rules. These modifications are easy from the 
results we have proven in this Section, so they are omitted. 



5 Empirical Comparison 

We have seen that one can find toy examples of datasets in which the output of the 
algorithm in [Kryszkiewicz, 2001] is incomplete. 

We have tested our algorithms on two real-world datasets: the training set part of 
the UCI Adult US census dataset (see [Asuncion and Newman, 2007]) and a Retail 
dataset (see [Brijs et al., 1999]). 

We have implemented three different algorithms: one for the incomplete heuristic 
given in [Kryszkiewicz, 2001], one that generates the complete set of representative 
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rules as described by Algorithm 1, and the last algorithm outputs a complete basis 
under the notion of closure-based redundancy. In order to get comparable results, 
all algorithms allow rules with empty antecedent and use the same definition of fre- 
quent sets and association rules as given in our preliminaries. We emphasize that, in 
general, the incomplete heuristic fails to produce a complete basis of representative 
rules. The code is available at [Balcazar, 2010b]. 

The first dataset under study, which we refer by the name of Retail, is a market 
basket data which consists of 88163 transactions over 16470 attributes. In order to 
preserve the anonymity of the clients, the data has been processed so that each item 
is represented by a number and each line break separates different customers. For 
the interested reader, the paper [Brijs et al., 1999] contains more information about 
this dataset. 

Table 2 shows the number of representative rules obtained for different support 
and confidence thresholds (the seventh column), the cardinality of the output set 
when m«iT is used (the fifth column) and the time elapsed in order to obtain them 
(the sixth and forth columns, respectively). We can see that although for higher sup- 
port thresholds the output of the algorithms is, most of the times, identical (recall 
that the output of the algorithm in [Kryszkiewicz, 2001] is always a subset of the 
whole set of representative rules), lowering both thresholds shows bigger differ- 
ences. 

Table 2 Comparison between GenRR and Algorithm 1 on the Retail dataset 



Data 


GenRR 


Algorithm 1 




Support 


Confidence 


Time 


Rules 


Time 


Rules 


7573 


0.1% 


0.9 


0.015 


248 


0.013 


248 


0.8 


0.013 


643 


0.013 


652 


0.7 


0.028 


1978 


0.026 


1990 


19115 


0.05% 


0.9 


0.036 


670 


0.022 


670 


0.8 


0.073 


2228 


0.041 


2229 


0.7 


0.123 


6029 


0.083 


6039 



Dataset Adult is a transactional version of the training set part of the UCI census 
dataset Adult US (see [Asuncion and Newman, 2007]); it consists of 32561 trans- 
actions over 269 items. On the Adult dataset, we see the same trend in the behavior 
of both algorithms. Note that in this case there are significant differences between 
the output of the algorithm in [Kryszkiewicz, 2001] and the set of all representative 
rules (Table 3). For example, for support and confidence thresholds of 0.05 and 0.7, 
respectively, more than half of the rules are lost. 

As an example, in the case the thresholds for support and confidence are 1 % and 
0.70, respectively, there are a total of 6867 representative rules, among which 3408 
are lost when using mns or bmns (four of them listed in bold, the rest of the rules 
are given as an example): 

[c:0.75, s:1.03] Private White age: 41 Male, 

[c:0.82, s:2.21] Never-married Unmarried ^ <=50K USA, 

[c:0.70, s:1.47] <=50K Assoc-acdm White => Private, 
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Table 3 Comparison between GenRR and Algorithm 1 on the Adult dataset 



Data 


GenRR 


Algorithm 1 




Support 


Confidence 


Time 


Rules 


Time 


Rules 


11920 


1% 


0.9 


0.147 


6578 


0.176 


7436 


0.8 


0.130 


4827 


0.148 


7379 


0.7 


0.096 


3459 


0.141 


6867 


27444 


0.5% 


0.9 


0.391 


15208 


0.380 


17573 


0.8 


0.298 


11516 


0.417 


18190 


0.7 


0.263 


8241 


0.382 


16779 



[c:0.75, s:3.74] Own-child Private hours-per-week: 40 <=50K Never-married USA, 

[c:0.75, s:3.74] Never-married Own-child USA hours-per-week: 40 <=50K Private, 

[c:0.87, s:1.03 ] Male Private age: 41 White 

[c:0.75, s:1.03 ] Private White age: 41 ^ Male 

[c:0.86, s:7.07 ] Exec-managerial Private USA White 

[c:0.73, s:1.04 ] Craft-repair Divorced =^ Male USA White 

[c:0.75, s:1.68] Not-in-family hours-per-week: 50 <=50K 

As mentioned in the beginning of this section, we have run experiments in order 
to see the performance of our algorithm that finds a basis under closed-based redun- 
dancy conditions. The results are in Tables 4 and 5. Notice that in this case the times 
are significantly lower. 



Table 4 Algorithm for Basis ^ (Retail dataset) 



Support 


Confidence 


Time 


Rules 


0.1% 


0.9 


0.006 


233 


0.8 


0.007 


643 


0.7 


0.013 


1984 


0.05% 


0.9 


0.029 


549 


0.8 


0.024 


2139 


0.7 


0.044 


6039 



Table 5 Algorithm for Basis y (Adult dataset) 



Support 


Confidence 


Time 


Rules 


1% 


0.9 


0.093 


7103 


0.8 


0.086 


7205 


0.7 


0.082 


6662 


0.5% 


0.9 


0.243 


16457 


0.8 


0.250 


17531 


0.7 


0.233 


16085 



We have run the experiments on an Intel Core i3-330M @ 2, 13GHz machine 
with 4 GB of RAM running under Microsoft Windows 7 Professional (64 bits). The 
running time of all algorithms were between 6 and 123 milliseconds in the case 
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of the Retail dataset and between 82 and 417 milliseconds for the Adult dataset. 
Algorithm 1 correctly outputs all representative rules at the cost of being sometimes 
slower than the possibly incomplete algorithm of Kryszkiewicz but, in our tests, the 
difference was rather irrelevant since the time needed to print the results on screen 
(a device slower than the CPU) still dominates the process. 

It must be noted that the quantity of representative rules may decrease at lower 
confidence or support thresholds. This phenomenon has been observed and ex- 
plained before (see [Balcazar, 2010a]) and is caused by powerful rules of a given 
confidence, say 0.8, that are filtered out at higher thresholds, leaving therefore many 
other rules as representative, but that force all of these out of the representative rules 
set as they become redundant when the confidence threshold gets below 0.8 and lets 
the powerful rule in. 



6 Conclusions 

We have proposed an alternative (complete) solution for the generation of the set 
of all representative rules defined in [Kryszkiewicz, 1998b] (see Algorithm 1); we 
have also shown that the original algorithm was incomplete. Our approach, which 
seems to requiere more operations than the one in [Kryszkiewicz, 2001], has the 
advantage of being guaranteed to output the whole set of representative rules. 

On the other hand, one of its main drawbacks is that we cannot reuse the pre- 
computed values of the parameters once the user changes the confidence threshold. 
Our proposal for fixing this problem involves dividing the process into two phases 
(see Algorithm 2 and Algorithm 3). As a conclusion, depending on whether one is 
interested in getting complete results or getting them faster, it is more convenient to 
use Algorithm 1 or the algorithm in [Kryszkiewicz, 2001]. 

We have also extended our approach to the similar but different basis correspond- 
ing to closure-based redundancy. Tests were performed in other to confirm that the 
algorithm is significantly faster than the previous two. 
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